This IPython notebook explains a basic workflow to match two tables using py_entitymatching. Our goal is to match restaurants from Fodors and Zagat sites. The datasets contain information about the restaurants.
First, we need to import py_entitymatching package and other libraries as follows:
In [1]:
import sys
sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/py_entitymatching/')
import py_entitymatching as em
import pandas as pd
import os
In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )
Matching two tables typically consists of the following three steps:
1. Reading the input tables
2. Blocking the input tables to get a candidate set
3. Matching the tuple pairs in the candidate set
In [3]:
# Get the paths
path_A = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/fodors.csv'
path_B = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/zagats.csv'
In [4]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
In [6]:
A.head()
Out[6]:
In [7]:
B.head()
Out[7]:
In [8]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)
Out[8]:
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching. py_entitymatching provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. The user can mix and match these blockers to form a blocking sequence applied to input tables.
For the matching problem at hand, we know that two restaurants with no overlap between the names will not match. So we decide the apply blocking over names:
In [9]:
ob = em.OverlapBlocker()
C = ob.block_tables(A, B, 'name', 'name',
l_output_attrs=['name', 'addr', 'city', 'phone'],
r_output_attrs=['name', 'addr', 'city', 'phone'],
overlap_size=1, show_progress=False)
In [10]:
C.head()
Out[10]:
In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes. This typically involves the following four steps:
First, we randomly sample 450 tuple pairs for labeling purposes.
In [11]:
# Sample candidate set
S = em.sample_table(C, 450)
Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.
In [12]:
# Label S
G = em.label_table(S, 'gold')
For the purposes of this guide, we will load in a pre-labeled dataset (of 450 tuple pairs) included in this package.
In [13]:
path_G = em.get_install_path() + os.sep + 'datasets' + os.sep + 'end-to-end' + os.sep + 'restaurants/lbl_restnt_wf1.csv'
G = em.read_csv_metadata(path_G,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable_id', fk_rtable='rtable_id')
len(G)
Out[13]:
First, we need to create a set of features.py_entitymatching provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.
In [14]:
# Generate features automatically
feature_table = em.get_features_for_matching(A, B)
Next, we convert the labeled data to feature vectors using the feature table
In [15]:
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the labeled data to feature vectors using the feature table
H = em.extract_feature_vecs(G,
feature_table=feature_table,
attrs_before = attrs_from_table,
attrs_after='gold',
show_progress=False)
Then, we train the learning-based matcher using the feature vectors. For the purposes of the guide, we will use Random Forest matcher that is included in the py_entitymatching package.
In [16]:
# Instantiate the RF Matcher
rf = em.RFMatcher()
In [17]:
# Get the attributes to be projected while training
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id', 'gold'])
attrs_to_be_excluded.extend(attrs_from_table)
In [18]:
# Train using feature vectors from the labeled data.
rf.fit(table=H, exclude_attrs=attrs_to_be_excluded, target_attr='gold')
Now, we use the trained matcher to predict matches in the candidate set. To do that, first we need to convert the candidate set to feature vectors.
In [19]:
# Select the attrs. to be included in the feature vector table
attrs_from_table = ['ltable_name', 'ltable_addr', 'ltable_city', 'ltable_phone',
'rtable_name', 'rtable_addr', 'rtable_city', 'rtable_phone']
# Convert the cancidate set to feature vectors using the feature table
L = em.extract_feature_vecs(C, feature_table=feature_table,
attrs_before= attrs_from_table,
show_progress=False)
Next, we predict the matches in the candidate set using the trained matcher and the feature vectors.
In [20]:
# Get the attributes to be excluded while predicting
attrs_to_be_excluded = []
attrs_to_be_excluded.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_to_be_excluded.extend(attrs_from_table)
In [21]:
# Predict the matches
predictions = rf.predict(table=L, exclude_attrs=attrs_to_be_excluded,
append=True, target_attr='predicted', inplace=False)
In [22]:
predictions.head()
Out[22]:
Finally, project the attributes and the predictions from the predicted table.
In [23]:
# Get the attributes to be projected out
attrs_proj = []
attrs_proj.extend(['_id', 'ltable_id', 'rtable_id'])
attrs_proj.extend(attrs_from_table)
attrs_proj.append('predicted')
# Project the attributes
predictions = predictions[attrs_proj]
In [24]:
predictions.head()
Out[24]: